InterPreTS is a tool designed to predict interaction based on the tertiary structure of proteins. To use it, we require FASTA files for each of the proteins we're interested in using. The greedy approach would be to attempt to get data for all the known human proteins from the gene database. These were already found when working on the Gene Ontology features.

Again, using the NCBI gene database and Biopython it's possible to find FASTA formatted records for each of these proteins. Then the tool can be downloaded from the InterPreTS website and predictions found that can be used as features.

Finding FASTA record for human proteins

Loading in the list of known human proteins:


In [1]:
cd ../../geneconversion/


/home/gavin/Documents/MRes/geneconversion

It turns out that other people have tried to do this before and it can be done using the gene2refseq conversion table and the batch Entrez tool here. Instead of using the script defined there using Python to do the mapping from Entrez Gene ID to protein accession number:


In [3]:
import csv

In [4]:
#open file and filter line by tax id, then match to the human id to the protein id
f = open("gene2refseq")
c = csv.reader(f, delimiter="\t")
geneidtoprotein = {}
for line in c:
    # make sure they're human proteins
    if line[0] == "9606":
        try:
            geneidtoprotein[line[1]] += [line[6]]
        except KeyError:
            geneidtoprotein[line[1]] = [line[6]]
f.close()

Distribution of protein IDs for each Gene ID

There are multiple protein IDs for each Gene ID due to isoforms, duplicate entries and invalid entries:


In [5]:
counts = []
for k in geneidtoprotein.keys():
    counts.append(len(geneidtoprotein[k]))

In [6]:
h=hist(counts, bins=arange(0,50)+0.5, range=(0,50))


Retrieving protein accession numbers

First, all the protein accession numbers, minus duplicates must be saved to a file:


In [12]:
proteinids = list(flatten(geneidtoprotein.values()))
print "Before removing duplicates %i protein gi IDs"%len(proteinids)
proteinids = set(proteinids)
print "After removing duplicates %i protein gi IDs"%len(proteinids)


Before removing duplicates 484877 protein gi IDs
After removing duplicates 72927 protein gi IDs

In [13]:
f = open("human.proteingi.txt","w")
for p in proteinids:
    f.write("{0}\n".format(p))
f.close()

Then this file is submitted to the [batch Entrez service, specifying Protein as the database. Tried this, but 72927 is too large for the service. It recommends downloading all proteins from the NCBI ftp service. Looking around it appears what we want is the file here.

It's described in the README as:

The RNA and protein directories provide sequence files in three formats representing all of the mRNA, non-coding transcript, and protein model reference sequences (RefSeq) exported as part of the genome annotation process.

Downloaded this and gunzipped it. Now have to parse this FASTA file with Biopython and match it to the protein database accession number:


In [17]:
from Bio import Entrez,SeqIO
Entrez.email = "gavingray1729@gmail.com"

In [50]:
proteinrecords = list(SeqIO.parse("protein.fa","fasta"))

In [40]:
print proteinrecords[0]


ID: gi|53828740|ref|NP_001005484.1|
Name: gi|53828740|ref|NP_001005484.1|
Description: gi|53828740|ref|NP_001005484.1| olfactory receptor 4F5 [Homo sapiens]
Number of features: 0
Seq('MVTEFIFLGLSDSQELQTFLFMLFFVFYGGIVFGNLLIVITVVSDSHLHSPMYF...VKF', SingleLetterAlphabet())

In [52]:
proteintorecord = {}
for r in proteinrecords:
    #get the accession number
    accnum = r.id.split("|")[1]
    #use that to index this record
    proteintorecord[accnum] = r

In [56]:
genetorecord = {}
for k in geneidtoprotein.keys():
    for p in geneidtoprotein[k]:
        try:
            genetorecord[k] += [proteintorecord[p]]
        except KeyError:
            try:
                genetorecord[k] = [proteintorecord[p]]
            except KeyError:
                #ignore invalid keys
                pass

In [59]:
# add Entrez annotations to FASTA files
for k in genetorecord.keys():
    for r in genetorecord[k]:
        r.description += " Gene ID: {0}".format(k)

In [62]:
# write new FASTA file
SeqIO.write(flatten(genetorecord.values()),"human.protein.fasta","fasta")


Out[62]:
295326

Mapping Gene ID pairs to FASTA pairs

Each Gene ID pair will map to multiple FASTA pairs and each of these will have to be evaluated by the InterPreTS tool. These different isoforms are likely to interact differently, as shown in this paper. Trying an example pair:


In [67]:
pair = (genetorecord.keys()[0],genetorecord.keys()[100])

In [68]:
print pair


('114787', '55839')

Writing a record from each of these identifiers to file:


In [76]:
SeqIO.write([genetorecord[pair[0]][0], genetorecord[pair[1]][0]],"example.pair.fasta","fasta")


Out[76]:
2

Web tool did not return anything, no hits. Trying with more proteins.


In [73]:
examplelist = []
for k in genetorecord.keys()[:20]:
    examplelist.append(genetorecord[k][0])

In [75]:
SeqIO.write(examplelist,"example.group.fasta","fasta")


Out[75]:
20

This produces some results which will be viewable for the next six weeks here.


In [77]:
import scipy.misc

In [78]:
scipy.misc.comb(len(genetorecord.keys()),2)


Out[78]:
202055253.0

Saving the dictionaries

Having problems with the perl script to finish the job here.

The FASTA mapping could still be useful though, so worth saving that. Saving it in two forms:

  1. Flat table file
  2. pickled dictionary

In [79]:
import pickle

In [80]:
# saving the table
f = open("gene.to.protein.human.txt","w")
for k in geneidtoprotein:
    for p in geneidtoprotein[k]:
        f.write("{0}\t{1}".format(k,p))
f.close()

In [81]:
#pickling the dictionaries
f = open("gene.to.protein.pickle","wb")
pickle.dump(geneidtoprotein,f)
f.close()
f = open("gene.to.record","wb")
pickle.dump(genetorecord,f)
f.close()